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Abstract 

We present a time-explicit discontinuous Galerkin (DG) solver for the time-domain acoustic wave equation on 
hybrid meshes containing vertex-mapped hexahedral, wedge, pyramidal and tetrahedral elements. Discretely 
energy-stable formulations are presented for both Gauss-Legendre and Gauss-Legendre-Lobatto (Spectral 
Element) nodal bases for the hexahedron. Stable timestep restrictions for hybrid meshes are derived by 
bounding the spectral radius of the DG operator using order-dependent constants in trace and Markov 
inequalities. Computational efficiency is achieved under a combination of element-specific kernels (including 
new quadrature-free operators for the pyramid), multi-rate timestepping, and acceleration using Graphics 
Processing Units. 


1. Introduction 

Highly efficient solution techniques exist for high order finite element and Galerkin discretizations on 
hexahedral meshes. Presently, producing high quality hexahedral meshes for complex domains is a difficult 
and non-robust procedure. Hybrid meshes, which consist of wedge and pyramidal elements in addition to 
hexahedra and tetrahedra, have been proposed to leverage the efficiency of hexahedral elements for more 
general geometries. Such meshes are commonly produced by adding a transitional layer of wedges and 
pyramids to a structured hexahedral mesh and constructing an unstructured tetrahedral mesh to fill in the 
remaining space |40j . Recently, techniques have been developed to automatically generate conforming hex- 
dominant hybrid meshes with a high percentage of hexahedral elements [5], which we aim to leverage to 
develop efficient solvers on more general geometries. We note that the utility of high order time-explicit DG 
using hybrid (including polygonal) meshes has been explored for polynomial physical frame discretizations 
in |18] . We consider in this work approximation spaces defined under a reference-to-physical mapping. 

Spectral and Zip-finite element methods on hybrid meshes were explored early on by Sherwin, Karniadakis, 
Warburton, and Kirby imiiiiis] using orthogonal polynomial basis functions constructed for hexahedral, 
wedge, pyramidal, and tetrahedral elements. While approximation spaces for the hexahedron, wedge, and 
tetrahedron have remained relatively unchanged since their introduction, more recent work has focused 
on the construction of alternative basis functions and approximation spaces for the pyramid. Bedrosian 
introduced a set of low-order vertex functions for the pyramid which yielded polynomial traces, allowing for 
conformity with low order finite elements for other shapes [3]. Unlike other elements, however, his pyramidal 
shape functions were rational in nature. Bergot, Cohen, and Durufle extended this construction in nil] to 
produce a high order basis for the pyramid (which we refer to as the rational basis). It was additionally 
shown that, under non-affine mappings of the reference pyramid, the mapped rational basis contains a 
complete polynomial space. As a result, the rational basis provides optimal high order convergence rates 
on vertex-mapped elements, which is not true of the polynomial pyramid basis. Rational pyramidal bases 
were also adapted to the H (curl), H (div) setting and exact sequence spaces in [5J lil 133 ■ The construction 
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of exact sequence high order bases for elements of all types (including pyramids) is addressed in recent work 
by Fuentes, Keith, Demkowicz, and Nagaraj m- 

Our focus is on high order discontinuous Galerkin (DG) methods on hybrid meshes. We pay particular 
attention to the efficient implementation of solvers on hardware accelerators. The computation-intensive 
nature of time-explicit methods lends itself well to modern accelerator architectures such as Graphics Pro¬ 
cessing Units (GPUs), and was exploited by Klockner, Warburton, Bridge and Hesthaven to construct an 
efficient high order DG solver on tetrahedral meshes using a single GPU [5^. Exposing both coarse and hne- 
grained parallelism resulted in several times speedup for meshes with around a hundred thousand elements at 
high orders of approximation. Increases in on-board memory in modern GPUs have allowed for the solution 
of even larger problems on a single unit. Additional gains in efficiency and problem size may be achieved 
via multi-rate timestepping and distributed-memory parallelism EH HD]. However, these technologies have 
been developed on all-hex or all-tetrahedra meshes, and while hetergeneous computing on hybrid meshes 
has been explored in the context of fluid dynamics and the closely related Flux Reconstruction method [48] , 
GPU-accelerated DG methods on hybrid meshes have not yet been analyzed in detail. 

Grucial to the efficiency of GPU-accelerated DG methods is limiting the memory required. Since each 
GPU has a relatively small amount of on-device storage, maximum problem sizes are typically memory- 
bound. Classical techniques for hexahedral and tetrahedral elements limit the amount of data that must be 
stored, and recent developments in the construction of basis functions 113 Hi nil make it possible to develop 
low-storage DG methods on pyramids and wedges as well. This work leverages each of these technologies 
to construct a low-memory GPU-accelerated high order DG solver on hybrid meshes for the acoustic wave 
equation. Additionally, order-dependent global and local timestep restrictions for general hybrid meshes are 
derived using constants in discrete trace inequalities for each element. 

The structure of this paper is as follows: Section introduces a low-storage DG method for hybrid 
meshes. The low-storage treatment of mass matrices is addressed through a careful choice of basis functions 
for each specific type of element. The variational formulation and element-specific operations are given, and 
a computational implementation on many-core architectures is described. Section [^derives order-dependent 
timestep restrictions for each type of element based on the constants in discrete trace inequalities. Finally, 
Section [^ reports numerical and computational results. 


2. A Low-storage DG method on hybrid meshes 


Typical DG methods result in a system of ordinary differential equations of the form 


du 

dr 


= M-'^Au, 


where r is time, A is a linear operator defined by the discretization, and M is the global mass matrix, which 
is block diagonal in nature. Efficient implementations of DG must account for the inverse of M in a fast 
and memory-efficient manner. One common strategy is to store the factorization or explicit inverse of each 
block of M~^; however, on GPU and other accelerator architectures, on-device memory is typically limited 
to a few gigabytes. Explicit storage of factorizations can rapidly exhaust available memory and limit the 
maximum problem size on a single GPU at high orders of approximation. For example, assuming a mesh of 
1 million planar tetrahedra in single precision, at order A^ = 4, storage of M~^ already exceeds the 1 GB 
of memory available on early GPUs used to accelerate DG methods [26], without accounting for the storage 
cost of other relevant data. 

We consider meshes consisting of hexahedra, wedges, pyramids, and tetrahedra. Given an order N = 1 
basis on each element, we may then dehne low-order vertex functions. A map from the reference element to a 
physical element may then be defined by interpolating physical vertex positions with these vertex functions. 
We refer to such elements as vertex-mapped elements, and restrict ourselves to meshes consisting of such 
elements in this work. Furthermore, we define J to be the the determinant of the mapping Jacobian, such 


that 
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Figure 1: Types of elements present in hex-dominant meshes. From left to right: reference hexahedra, wedge, 
pyramid, and tetrahedra. 


where K, K are the physical and reference elements, respectively. 

2.1. Basis functions 

We will sidestep the cost of storage of M~^ by tailoring our choice of basis functions for each element. 
Assuming that all elements are vertex-mapped, we will construct basis functions for the hexahedron, wedge, 
and pyramid which yield diagonal mass matrices. For vertex-mapped tetrahedra, the mass matrix will not 
be diagonal, but will be a scalar multiple of the reference mass matrix. 

2.1.1. Hexahedra 

We take the reference hexahedron to be the bi-unit cube, with coordinates (a, 5, c) G [—1,1]^. The 
approximation space of order N on the hexahedron is simply the tensor product space of polynomials of 
order JV in each direction. For a vertex-mapped hexahedra H, this space and its dimension Np are given as 
follows 

Fjv(n) = {xyz^ 0<i,j,k<N}, Np = (N + y. 

We also have that J G Pi('H), and an appropriate quadrature rule can be constructed using a tensor product 
of standard ID Gauss-Legendre rules. If we take the orthogonal basis 

4>ijk{a,b,c) = £i{a)£j{b)ekic) 0 < i,j,k < N, 

where ii is the ith Lagrange polynomial at the ith Gauss-Legendre node, the mass matrix defined by 

£^ijk,lmn ~ I f^ijkf’lmn ~ / f’ijkf^lmnJ dx 
JV. Jn 

is diagonal for J G Pi {H). To apply we store only of the inverse of the diagonal of the full mass 

matrix, which requires the storage of NpK values, instead of storing the 0{NpK) entries required for the 
factorization of the global block-diagonal mass matrix. 

We may also achieve a diagonal mass matrix by taking Lagrange polynomials at tensor product Gauss- 
Legendre-Lobatto nodes, which includes points at both —1 and 1. Applying Gauss-Legendre-Lobatto quadra¬ 
ture results in a diagonal mass matrix, and is the basis behind the Spectral Element Method [371 [21]. We 
refer to the choice of basis using Gauss-Legendre nodes as the GL basis, and the basis using Gauss-Legendre- 
Lobatto nodes as the SEM basis. We will compare both of these choices in this work. 

2.1.2. Tetrahedra 

For the reference tetrahedron, an orthogonal basis may be given as the product of Jacobi polynomials on 
the bi-unit cube [MHHllIl] 


ykMc) = P°’°(a)(l - 6)*Pf+i’°(6)(l - 
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(a) SEM nodes 


(b) GL nodes 


Figure 2: SEM and GL nodes on the quadrilateral. For GL nodes, additional surface nodes must be specified 
(shown as squares). 


The tetrahedral basis is then given by the Duffy mapping from the bi-unit cube to the bi-unit right tetrahe¬ 
dron with coordinates (r, s, t) 


r = (1 -I- a) 


1 - b 




t = c. 


The approximation space for a vertex-mapped tetrahedra is the space of polynomials of total order N with 
dimension Np 


P{T) = {xYz\ 


i + j + k < N} , 


Np = 


(N + 1){N + 2){N + 3) 

6 


A convenient fact about vertex-mapped (planar) tetrahedra is that J is constant on each element. Thus, 
each local mass matrix is simply a constant scaling of the reference mass matrix, and only the reference mass 
matrix is stored. In our implementation, we choose a nodal Lagrange basis (dehned implicitly using the 
above orthogonal basis) where the nodes are placed at optimized interpolation points on the simplex |44| . 


2.1.3. Wedges 

An orthogonal basis for the reference wedge may be given in terms of Jacobi polynomials on the bi-unit 
cube 


Ma,b,c) = P°’°{a)P°'°{b) 


1 — c 




0 ^ h j ^ 0 < k < N — i. 


Applying a Duffy-type transform gives a basis on the reference bi-unit wedge with coordinates (r, s, t) 


r = (1 -I- a) 


1 — c 


- 1 , 


s = b, t = c. 


For a vertex-mapped wedge W, J G Pi(>V), and the approximation space of order N on the wedge and its 
dimension are 

PMm = {xVz\ 0<i,J<N, 0<k<N-i}, Np= ^^^ . 

We may construct an appropriate quadrature rule on the reference wedge based on the tensor product of 
Gauss-Legendre rules in the r, s coordinate, and a Gauss-Legendre rule with weights (1, 0) in the t coordinate. 
Alternatively, a wedge cubature may be constructed by taking the tensor product of an optimized cubature 
on the triangle with Gauss-Legendre cubature in the s coordinate. We take the latter approach in this work, 
using rules from Xiao and Gimbutas [49]. The triangle quadrature is chosen such that the rule is exact for 
polynomials of total order 2N J- 1, similar to Gauss-Legendre quadrature. 

Decreasing storage costs for the wedge mass matrix is less straightforward. One common approach is 
to construct a basis using the tensor product of triangular basis functions with Lagrange polynomials at 
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+ 1 Gauss-Legendre points. The resulting mapped basis is orthogonal in the b direction, and the local 
mass matrix becomes block diagonal, with each block corresponding to a triangular slice of the wedge. 
Unfortunately, for general vertex-mapped wedges, J is bilinear on each triangular slice, and each block of 
the local mass matrix corresponds to a J-weighted inner product on a triangle. Since there is clear basis 
which is orthogonal in this inner product for an arbitrary bilinear J, the blocks of each local mass matrix 
are typically dense and distinct from element to element. Factorizations of each block are required in order 
to efficiently apply the mass matrix inverse, resulting in increased storage costs. 

We may address this storage cost by using a Low-Storage Curvilinear DG (LSC-DG) approach [45ll46] . 
where we define the non-conforming basis (pijk 


4^ijk — 


4^ijk 

7J’ 


such that 


^ijk.lmn — /_ 4^ijk4^lmnJ dx 


/W 


/W 


4^ijk 4^1 mn 

'/j '/j 


J dx = M, 


ijk^lmn • 


As a result, each mass matrix is identical to the reference mass matrix, allowing us to apply M~^ while 
accounting for only a single mass matrix inverse over all elements. 

The use of LSC-DG results in a rational basis; consequentially, standard quadratures are not exact and 
aliasing errors may result from underintegration. To sidestep these issues, we restrict ourselves to variational 
formulations that may be written in a skew-symmetric fashion, such that DG is energy stable irregardless of 
the choice of quadrature [46) . Additionally, since our approximation space now contains rational functions, 
standard polynomial approximation results do not hold for LSC-DG wedges. Under a quasi-regular scaling 
assumption on elements in our mesh, we recover instead an approximation bound of the form 


llu - IfujUll 
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Vj 


VT 

L°-{K) 



WN + l.=o(/f) 




where the weighted projection IIioU is the projection onto the LSC-DG space, h is the size of the element K, 
and denotes the L°° Sobolev norm of order A^-|-l over K. For comparison, the projection error 

bound for curvilinear mappings using standard mapped bases replaces the Soblev in the above bound with 
an norm. In other words, while the approximation error for standard curvilinear elements depends only 
on the extremal values of J, it depends additionally on the smoothness of J when using rational LSC-DG 
basis functions. 


2.1.4- Pyramids 

Unlike hexahedral, wedge, and tetrahedral elements, the basis functions for the pyramid are rational in 
nature. For a vertex-mapped pyramid V, the approximation space and dimension Np are 


Bm{V)=P 



N-k 


; 0 < i -I- j < fc 


Np = 


{N +1){N + 2){2N + 3) 

6 ^ 


where Pn{P) is the space of polynomials of total order N on the physical pyramid. For a vertex-mapped 
pyramid, we have that the change of variables factor J G Bi{P). An appropriate quadrature rule is defined 
on the reference pyramid using a tensor product of Gauss-Legendre rules in the r, s coordinate, and a Gauss- 
Legendre rule with weights (2, 0) in the t coordinate. 

Since J is again non-constant, each local mass matrix is distinct and dense, increasing storage costs for 
DG methods on pyramids. However, it was shown in m that there exists an orthogonal basis (j)ijk on 
vertex-mapped pyramids which spans BjqiV). This basis is defined on the bi-unit cube as 


4ijk o) 


WiWj 


1 — c 




0 < fc < AT, 
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where £^{a),£j{b) are order k Lagrange polynomials at {k + 1) Gauss-Legendre nodes, and is the 

Jacobi polynomial of degree k with order-dependent weight 2k+ i. Wi is the ith Gauss-Legendre quadrature 
weight, and the normalization constant C^_f, = {N -|-2)/(2^^+^(2fc-|-3)). The pyramid basis is then defined 
under the Duffy-type mapping from the bi-unit cube to the bi-unit right pyramid 

r = (1-ha) - 1, s = (1 -h - 1, t = c. 

The mass matrix is diagonal for vertex-mapped pyramids under such a basis, and the entries of the mass 
matrix are the evaluation of J at the quadrature points ,bj. As with hexahedral elements, the application 
of M~^ requires only storage of the diagonals of each local mass matrix. 


2.2. Variational formulation 

We consider the acoustic wave equation on domain D with a free surface boundary condition p = 0 on 
the boundary of the domain d£l. This may be written in first order form 


1 9p ^ 

-L -I- V 

Kdr 


u = f 


du 


where p is acoustic pressure, u is velocity, and p and k, are density and bulk modulus, respectively. We 
assume also a triangulation of the domain iJ/j consisting of elements K with faces /, and that p and k are 
piecewise constant on each element. We will denote the union of the faces / as T^i. Let {p~,u~) denote the 
solution fields on the face of an element K, and let {p'^,u^) denote the solution on the neighboring element 
adjacent to that face. We may then define the jump of p and the normal jump and average of the vector 
velocity u componentwise 


lpj=p+ -p , 


[m] = M+ - M , 


{{«}} 


M+ + U 
2 


Defining n as the outward unit normal on a given face, the (local) variational formulation for the discon¬ 
tinuous Galerkin method is then 




u Vcj) dx + 


'K 


IdK 


— n 


IK 


dui f 

dx = — I Vp • 'ip~ dx + 


IK 


/ o N ■ 

IdK ^ 


{{«}} jf) dx 

■ - Ip]) ip~ ■ n~ dx. 


( 1 ) 

( 2 ) 


where Tp = 1/ {{pc}}, Tu = {{pc}}, and c^ = nf p is the speed of sound. Only one equation has been integrated 
by parts, resulting in a skew-symmetric and energy-stable formulation (the choice of which equation to 
integrate by parts is arbitrary; both produce skew-symmetric formulations). Proofs of energy stability 
typically rely on the exact evaluation of integrals [32] • Due to the use of LSG-DG on wedges, physical basis 
functions become rational in nature and are not exactly integrable using standard quadrature rules. The 
skew-symmetry of the variational form guarantees energy stability irregardless of quadrature rule [4fi] . 

While we adopt a skew-symmetric variational formulation when necessary for energy stability, we do not 
always compute with it. Instead, we use the fact that the skew-symmetric form (§ is sometimes equivalent 
to the twice integrated-by-parts “strong form” [22] 
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L 




; dr 
du _ 


dx 


W ■ u6 dx -I- 


/ I (xp IpI - 

JdK ^ 

[ Vp-ip~ dx+ [ i (r„ |m] 

Jk JdK ^ 


M)<(' dx 

— |p|) ■0“ • n~ dx. 
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Figure 3: The choice of nodal basis for the hexahedron determines the quadrature rule on quadrilateral faces 
of both pyramids (shown above) and wedges. 


This holds, for example, if integration by parts holds when volume and surface integrals are replaced with 
quadrature approximations, and is related to the “summation-by-parts” property in finite differences. 

The choice between the strong/skew formulations is governed by the choice between GL and SEM nodal 
bases for the hexahedron, which determines the quadrature rule on quadrilateral faces (see Figure [^. When 
SEM quadrature is used, surface integrals over the quadrilateral face are computed inexactly; as a result, 
the strong formulation may not be equivalent to the weak formulation, and may not be energy stable. 

For hexahedral elements, the strong and skew-symmetric forms are equivalent under both GL and SEM 
bases. This was shown by Kopriva and Gassner in [28] and for SEM, relies on the exact cancellation 
of Gauss-Legendre-Lobatto quadrature errors in volume and surface integrals. We may thus exploit the 
improved computational efficiency of the strong form for hexahedra for both SEM or GL nodal bases. For 
tetrahedra, since the mapping is constant for planar simplices, volume integrals are computed exactly using 
quadrature free techniques, and surface integrals are computed exactly using quadrature. Thus, the skew- 
symmetric and strong forms are equivalent for vertex-mapped tetrahedra, and we adopt the strong form for 
improved computational efficiency. 

For the wedge, we are restricted to the skew symmetric form, since integrals over the rational LSG-DG 
basis are inexact for any polynomial cubature rule. For the pyramid, the energy stability of the variational 
form depends on the choice of SEM or GL quadrature for the quadrilateral face. Under Gauss-Legendre 
quadrature, surface integrals for the pyramid are exact, and both the strong and skew-symmetric form are 
equivalent. However, if SEM quadrature is used for the quadrilateral face, the surface integral is inexact and 
we must use the skew-symmetric form for stability. 

We may now choose freely between a SEM or GL nodal basis for the hexahedron, which also determines 
the choice of quadrature on quadrilateral faces. SEM nodes tends to be more efficient than Gauss-Legendre 
nodes — with Gauss-Legendre quadrature, since the nodal points lie in the interior of the hexahedron, an 
additional evaluation step is necessary to compute the solution at points on the surface. Since the SEM 
nodal basis contains both volume and surface quadrature points as degrees of freedom, evaluation of the 
solution on the surface requires only retrieval of a single degree of freedom, and this extra step is avoided. 
Additionally, due to the Lagrange property of the SEM nodal basis functions, the surface contributions are 
sparser for the SEM formulation than for the GL formulation. 

However, it is also known that the inexact integration in SEM quadrature reduces the accuracy of the 
resulting solution [^, though the inner products generated by both underintegrated SEM quadrature and 
full Gauss-Legendre quadrature are known to be equivalent (with constants that do not grow in N). We will 
also show in Section |4T] that SEM underintegration can reduce the observed order of convergence for smooth 
solutions. Additionally, the use of SEM on hybrid meshes also restricts the pyramid to the skew-symmetric 
form, which is typically less computationally efficient than the strong form, while the use of Gauss-Legendre 
nodes on the hex allows for the use of the strong form for the pyramid. 

We naturally arrive at two energy stable formulations based on the two nodal bases for the hexahedron, 
which are summarized for each element type in Table We refer to the formulation using SEM hexahedra 
as the “SEM formulation”, and the formulation using Gauss-Legendre hexahedra as the “GL formulation”. 
We will compare the accuracy and efficiency of each formulation in Section 
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SEM Formulation 

GL Formulation 

Tetrahedra 

Nodal, strong form 

Nodal, strong form 

Pyramid 

Semi-nodal, skew-symmetric form 

Semi-nodal, strong form 

Wedge 

Modal LSC-DG, skew-symmetric form 

Modal LSC-DG, skew-symmetric form 

Hexahedra 

Nodal SEM, strong form 

Nodal Gauss-Legendre, strong form 


Table 1: Summary of stable bases and formulations for element type-specific variational forms. 


2.3. Element-specific operations 

Our numerical implementation tailors both the local variational formulation and computational opera¬ 
tions to each specific type of element. For each element type, we construct three kernels 

1. Volume kernel: compute the volume contribution (integrals over the interior of the element K). 

2. Surface kernel: compute the surface contribution (integrals over the surface of the element dK). 

3. Update kernel: apply the inverse of the mass matrix where necessary, execute a step of Adams- 
Bashforth, and evaluate/store the solution at cubature points on the surface. 

Algorithms describing the implementation of relevant kernels for each element type are given in the 
following sections. For the wedge, pyramid, and tetrahedron, surface and update kernels are very similar to 
those given in El for evaluation of the solution at surface cubature points. Their implementation differs 
only by the specific surface cubature, which is constructed by combining appropriate cubatures for triangular 
and quadrilateral faces. For triangular faces, we use quadrature rules for the triangle computed by Xiao and 
Gimbutas [33] that are exact for polynomials of degree 2N. For quadrilateral faces, we use a tensor product 
Gauss-Legendre or Gauss-Legendre-Lobatto (SEM) cubature rule with {N -|- 1)^ points, depending on which 
hexahedral nodal basis is chosen. 


2.3.1. Hexahedral elements 

The volume contribution for the hexahedron may be computed efficiently by exploiting both the Lagrange 
property of the nodal basis and the tensor-product form of the basis functions. Basis functions for the 
hexahedron are constructed 

(j)ijk{r,s,t) = ii{r)£j{s)£kit), 

where £i{r) is the Lagrange polynomial at the Ah node in the r direction, and similarly for £j{s), £k{t). This 
implies that the evaluation of volume integrals using quadrature reduces to 


f du 

AT+l 

X -^ T 

du 

du 

Ik dx 

/ ^ '^V ,m' ,n' Jv ,m' 

' ,7n' .n' — l 

dx 

^lmn\ii ytl' n' — '^ImnJlmn 


due to the Lagrange property of (j)imn at quadrature nodes. The values of derivatives ^ require computation 
of fy, fy at nodal points, which reduces to the computation of ID derivatives 


du 

dr 


Imn 


The integral is then scaled by = {wimnJimn) removing the weight and geometric factor multi¬ 

plying the derivative. This implementation is described in detail in Algorithm 

The computation of numerical fluxes requires the evaluation of the solution on the surface of a hexahedron. 
This is done in the update kernel for the hexahedron, and is described in Algorithm For a ID nodal basis 
at GL points, we may evaluate the solution at endpoints ±1 via 

Af-l-l N+1 

^( 1 ) — ^ ^ £k{ ^( 1 ) — ^ ^ 

k=l k=l 
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Algorithm 1 Algorithm for the hexahedron volume kernel (both SEM and GL). 

1; procedure Hexahedron Volume kernel 

2; Load nodal values Uijk (for l<i,j.,k<N+l) and the ID operator Dij into shared memory. 
3: for each node x^jk do 

4: Compute derivatives with respect to reference coordinate r, s, t. 


dujxjjk) 

dr 


N+l 

1=1 


Oui^Xijk') 

¥s 


Af+1 

^ ^ kdjlUilkj 
1=1 


dui^Xijk') 

Wt 


N+l 

Dkiuiji. 

1=1 


5: Scale by change of variables factors to compute fy, fy at point Xijk- 


du du dr du ds du dt 
dx dr dx ds dx~^ dt dx' 


where we have used symmetry of the GL points across 0. We store values of ffc(—1) in an array V/ = ffc(—1). 
For a SEM basis, ffe(—1) = 5ki^ and the evaluation of the basis on the boundary reduces to the loading of 


Algorithm 2 Algorithm for the GL hexahedron update kernel. 

1; procedure Update kernel 

2; Load nodal values Uijk (for 1 < i,j^k < N + 1) and into shared memory. 

3; for each 1 < i, j, fc < A + 1 do 

4: For face cubature points Xjk (face r = ±1), Xik (face s = ±1), and Xij (face t = ±1). 


N+l 

u(Xjk]r = — 1 ) = 'y ( V^Umjkj 
m—l 
iV+1 

^ — 1) — ^ ^ 
m—l 
N+l 

u{xij]t = — 1 ) = ^ ( V^Uijmi 
m—l 


N+l 

u{Xjk\r = 1) = 

m—l 

N+l 

S = 1) = ^ ^ 
m—l 
N+l 

u(Xij]t = 1) = y^ V^_^Umjk- 

m—l 


the proper nodal degrees of freedom on each face. 

The computation of the hex surface contribution differs slightly between SEM and Gauss-Legendre nodal 
bases. We compute contributions face by face, mapping surface integrals to the reference quadrilateral. For 
example, on the face corresponding to t = —1, this gives 


f f gijkuJ^ = f /fi(r)fj(s)4(-l) = WiXi;j4(-l) uJ® 

J r J s J r J s 


hi.Sj,-! 


where Wi, Wj are ID quadrature weights and J® is the determinant of the Jacobian mapping from the physical 
to reference quadrilateral face. For a Gauss-Legendre basis, ffc(—1) must be evaluated explicitly. For the 
SEM formulation, the evaluation of £/j(±l) may be skipped, and further optimizations may be done by 
noting that by the inverse mass matrix may be premultiplied into the geometric factors, and that due to 
the Lagrange property of nodal polynomials, surface RHS contributions for interior nodes are zero. This 
quantity is then scaled by the inverse mass matrix to compute the surface contribution to the RHS, and the 
full procedure is outlined in Algorithm]^ 
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Algorithm 3 Algorithm for the hexahedron GL surface kernel. 

1; procedure Hexahedron Surface kernel( replace u with an appropriate numerical flux.) 
2: for each 1 < *, j, fc < A + 1 do 

3: For faces r ± 1, compute the surface RHS contribution 

4: Compute similar contributions for faces s, t ± 1 


^ijk,ijk‘^i'^k J il) ^fe)) ^ijk,ijk'^^'^3 \ri,Sj Sj , ±1) . 


2.3.2. Tetrahedral elements 

Assuming Lagrange polynomials ij{r, s, t) defined by Np nodal points (r^, Si, U) on a tetrahedron, we may 
define operators which evaluate derivatives at those same nodal points 


= 

u 


irii li) 


D’^ = 


ir i •! ^itti^ 


t _ Si^ti) 


D' = 
o 


dr ds dt 

As shown in [22j , RHS contributions may be computed using only the above derivative operators 

JKdx^ ^ ^ dx ^ 


where x,u are vectors of the nodal positions Xi and nodal degrees of freedom Uj. 


Algorithm 4 Algorithm for the tetrahedron volume kernel. 

1; procedure Tetrahedron Volume kernel 
2: Load nodal degrees of freedom Uj into shared memory. 

3: for each node Xi, i = 1,..., Np do 

4: Compute derivatives with respect to reference coordinates r, s, t 


5: 


dujxj) 

dr 


ly p 

f=i 


du{xi) 

ds 


ly p 

E ’ 

i=i 


du{xi) 

dt 


ly p 

E 

i=i 


Scale by change of variables factors to compute fy, f)) i fy at point Xi. 


In the update kernel, we evaluate solution fields at cubature points on the surface. For nodal tetrahedra, 
this evaluation may also be done face-by-face to reduce cost at high orders, since traces of solution fields 
depend only on face nodal values. 


2.3.3. Wedge elements 

Evaluation of the skew-symmetric form involves the computation of two types of volume integrals 




71' 


Since derivatives in x (and similarly for y, z) of the LSC-DG basis are given by 


d^ _ _ 1 /del) (t> dJ\ 

dx dx -y/j 2J dx) ' 
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we precompute and store the values of the physical gradient VxyzJ at each cubature point. Derivatives of u 
and the first volume integral (pi ^ reduce to integrals over the reference element 






du 

dx 




i=i 


d(j)j (j)j dJ 
dx 2J dx 


Ik "dx 


Ik 


(pi du 
y/j dx 


J = 


Ik ''dx' 


Similarly, the second volume integral reduces to 

f d(p^. 


Ik 


dx 


u = 


Ik 


Ik 


dpi pi dJ 
dx 2J dx 
dpi dpi dpi 
dr' ds' dt' 


u = 


Ik 


dpi dr dpi ds dpi dt 1 9J' 
dr dx ds dx^ dt dx ''2J dx ^ 



dr 

ds 

dt 

1 dJ 



ax 


^2J dx_ 


For LSC-DG, we split the evaluation of the skew-symmetric formulation into the computation of the trial 
integrand (involving the solution) and the test integrand (involving test functions pi). Intermediate values 
used in the test integrand are stored in global memory. In the first step, derivatives of the solution are 
computed, and values of the solution at cubature points are premultiplied by appropriate change of variable 
factors, as indicated in the computation of the second volume integral above. In the second step, test functions 
and their derivatives in the reference r, s, t coordinates are evaluated at cubature points and summed up to 
compute all integrals. 

The splitting of the wedge volume kernel into two steps allows us to load geometric factors and derivatives 
of J only in the first step, while maintaining fast coalesced memory access patterns. Additionally, splitting 
into two kernels and writing intermediate values to global memory avoids the overuse of shared memory, 
which can reduce workgroup occupancy. 

Finally, since the wedge is a tensor product of triangle and line elements, we may additionally decompose 
interpolation and derivative operators into 2D triangle and ID operators. Exploiting the tensor product 
nature of these operators on the wedge leads to a reduction in the cost of quadrature. We assume that wedge 
basis functions may be decomposed as pi{r, t)pj{s), where pi{r, t) are basis functions over the triangle. Then, 


N+l 

w(DS,t) = E E 

i=i j=i 


where Uij are coefficients for the wedge, decomposed into triangle and ID indices i and j, respectively. Our 
cubature is defined also as a tensor product of a triangle and ID cubature rule with and {N + I) points, 
respectively. For cubature points x^i and weights wu, where k is the index for a triangle cubature point and 
I is the index for a ID cubature rule, we precompute the values of pi{r, t), pj{s) (as well as their derivatives) 


Fife — Pi{'l^k:tk): 


.rr,t _ dpi{rk,tk) 
^ik ~ 


dr, t 


Q<k< Nf, 


nlD dpjisi) 


0 < / < iV-h I, 


which are used in Algorithms and 

Compared to a sum-factorization approach, the tensor product sum in Algorithms does not decrease 
the total number of computations. We found that implementations of sum factorization required larger 
amounts of either shared or register memory, both of which decrease occupancy. Instead, we exploit the fact 
that triangle operators are constant for each ID summation, increasing data reuse. 

For Algorithm we again reuse loaded triangle operators, which are constant for each ID summation. 
Additionally, we may load premultiplied information and execute dot products in the summation using 
float4 data and operations, which are pipelined for faster execution on certain GPUs. 
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Algorithm 5 Algorithm for part 1 of the wedge volume kernel. 

1; procedure Wedge volume kernel Part 1 

2: Load coefficients Uij and ID interpolation/derivative operators into shared memory. 

3: for each cubature point Xk, k = 1,... Nc do 

4: Compute (using the tensor product form) the solution and reference coordinate derivatives 


5: 

6 : 


u{Xk) = 


K' N+1 




/ . ^kj 


du{xk) 
dr, t 


i=l 


i=i 


NT 


N+1 


i=l i=l 


du{xk) 

ds 


^ P iV +i 

E! E] ^kj 


Scale by change of variables factors to compute fy, ff at cubature point Xk. 

Write (to global memory) intermediate derivatives and premultiplied values at cubature points 


xyz ^ 


' dr ds dt 1 dJ 


Algorithm 6 Algorithm for part 2 of the wedge volume kernel. 
1; procedure Wedge Volume kernel part 2 
2: for each cubature point a;^, fc = 1,... do 

3: Load intermediate values at cubature points 


dr ds dt 1 dJ 


4: Compute integrals via cubature 



NT N+l 

v (u°) 




T 

li 


dujxki) 

dx 


Wkl 


IK 


dcjiij 

dx 




A+1 


z=i 


ID 




dr ds dt 1 dJ 


2.3.4- Pyramidal elements 

The form of the pyramid mapping and the semi-nodal pyramid basis lend themselves to simpler evaluation 
of integrals. We introduce a quadrature-free method of computing volume contributions, which improves 
upon the quadrature-based method for the pyramid basis described in m- Let u be a function in the span 
of the pyramid basis on the bi-unit reference pyramid V; then, 



Since we assume u is in the pyramidal space, it may be represented with degrees of freedom Uijk- Eval- 
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uating using the chain rule gives 


du da 




/^N 

^N-k 


Putting this together with we have 


du 

'Imn ^ 




dr 


/^N ^N 
^N-n^N-k 


EE^ 

i=0 j—0 


w^Wj da ^ 




Denoting 


D 


lmn,ijk 


Iw'^w^ di^iaf)^ 


Ic /c 

wfWj 


da 




p2fc+3p2n+3 f l-c\l+fc+" 
^N-k ^N-n \ 2 ) 


! r'N r'N 
y ^N-n^N-k 

we have that behaves as a weak derivative operator over the reference element 

f , du 


N k k 

^ E E E ^lm.n,ijk^ijk- 

k—0i—0j—0 


We may similarly define D® such that 

f ^ ^ 

./•p ds 


= D 




lmn,ijk 


wfwj 




db 


To take derivatives with respect to the t coordinate, we use the chain rule for ^ 

2 


du db 

du 

du 

db dt dc 

da 

du ! 

'l + b\ 

du 

dr \ 

. 2 J 

ds 


1 — c 


du f 1 + b 
db 


1 — c 


du 


An operator for the t derivative is constructed similarly. We first define a derivative operator in c 


D 


lmn,ijk 


n^..n 


d 




P^”+^(c) (if^) 


^\2+n 


k k 
wfWj 


dc 


! r'N 7t/v 

y '-'N-n^N-k 


Using this, D* may then be defined as 


D] 


Imn^ijk 


1 + a. 


D' 


lmn,ijk 


i + 6n 

_ l_ n® -I- n'^ 

2 I ^lran,ijk ' ^lmn,ijk' 


We now exploit the fact that both geometric factors and the determinant of the Jacobian J are constant 
in the c (and thus t) direction for all vertex-mapped pyramids [0 [TT]. As a result, geometric factors are 
constant in the index n. This implies that the integral of a derivative in the x coordinate over a physical 
pyramid may be given as 
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The above expression is simply multiplication by (square) semi-nodal operators and entry-wise 

scalings by geometric factors, similarly to nodal methods for hexahedral or tetrahedral elements. Further¬ 
more, the scaling by Jim is removed when applying the inverse of the diagonal mass matrix 


M, 


Imn.lmn 


1 

Jim 


When computing integrals in the skew-symmetric form, we may simply apply the transpose of each operator 


d(l)ir. 


dr 


'^J — ^ Jij^ijk- 

ijk 


and similarly for {D^)^ , In this case, the factor of J is not cancelled out after multiplying by M~^. 

The pyramid kernel is described in detail in Algorithm Similarly to the wedge, the algorithm is made up 
of two parts (computation of the trial and test portion). However, unlike the wedge, the semi-nodal nature 
of the pyramid basis allows us to store intermediate values efficiently in shared memory. 


Algorithm 7 Algorithm for the (skew-symmetric) pyramid volume kernel. 

1; procedure Pyramid Volume kernel 

2: Load semi-nodal degrees of freedom Ui into shared memory. 

3: for each semi-nodal basis function j = 1,..., Np do 

4: Store premultiplied Uj by geometric change of variables and mapping factors 


dr{xj) 
dx ' 




ds(xi) 
dx ' 


jJ{Xj), 


= 


dt(xj) 
dx ' 


jJ{Xj) 


5: for each semi-nodal basis function i = 1, ..., Np do 

6: Compute weak derivatives (no integration by parts) with respect to reference coordinates {r,s,t) 

using semi-nodal derivative operators 


IK 


du 

'Ih 








Ik 


du 
' ds 




= Ea- 


i=i 


Ik 


du 


Np 

i=i 


7; Scale by change of variables factors to compute weak physical derivatives and RHS contri¬ 

butions. 

'dudr[xi) ^ duds{xi) ^ du dt{xi)' 


IK 


du 


IK 


dr dx 


ds dx 


dt dx 


8: Compute weak derivatives (integrated by parts) with respect to physical coordinates (x, y, z) using 

transposed semi-nodal derivative operators and premultiplied degrees of freedom. Scale by the inverse 
mass matrix to compute RHS contributions. 


M, 


-1 


IK 




1 


d(j)i dr 
dr dx 


d(j)i ds 
ds dx 


d(f>i dt' 
dt dx , “ 


J{Xi 




3 = 1 


2.4- Many-core implementation 

The low-storage DC method described above has been implemented in hybridg, a portable, GPU ac¬ 
celerated solver for the acoustic wave equation on hybrid meshes. Portability between OpenMP, OpenCL 
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and CUBA platforms is achieved through the OCCA programming model [53]. The solution is evolved in 
time using a multi-rate Adams-Bashforth (MRAB) scheme. Support for multiple GPUs is facilitated through 
MPI, though load-balancing strategies for scalability remain to be investigated. The kernels for each element 
type are optimized according to the operations described above. Additional optimizations may be done by 


tuning the number of elements processed per workgroup (batching). In Section 4.3 we give results where 


optimal batch sizes are determined experimentally for each element type and approximation order N. 

Due to large variations in element sizes, as well as variations in element type-dependent trace con¬ 
stants, local timestep restrictions may vary significantly over the mesh. To sidestep overly restrictive global 
timesteps, we use multi-rate timestepping in the form of a 3rd order Adams-Bashforth scheme. The local 
timestep dr^ for each element is derived in Section]^ and the global timestep riTnim is taken as the minimum 
over all local timesteps. For the MRAB scheme, Meveis timestep levels are determined via 




1 ^ lev ^ A]e.^eig. 


Elements are then binned into each timestep level — a given element K is assigned a timestep of driev 
if driev-i < dTfc < driev A second sweep through the mesh moves elements from coarser timesteps to a 
finer timesteps in order to guarantee that neighboring elements differ only by one level at most (or that the 
timesteps of neighboring elements differ only by a factor of 1 or 2). The solution is then updated according 
to the coarsest timestep, and each timestep level is evolved times for every coarse timestep. 

Since the details of the implementation are independent of element type, we refer the reader to EH 1171151] 
for a description of the multi-rate scheme on triangular and tetrahedral meshes. 


3. Timestep restrictions 

Stable local and global timesteps are necessary for multi-rate schemes. We derive here timestep re¬ 
strictions for hybrid meshes by bounding the DG operator norm. These bounds are given in terms of 
order-dependent constants and physical/geometric quantities for each element type. 

To simplify notation, we define group trial and test variables C/, V 



Under time-explicit DG methods, the discretized acoustic wave equation results in a system of ODEs 

f ^ M-AC/). V-MU ^ (I + I ^r) 

where M is the scaled mass matrix. The global matrix A is given as the sum of local matrices A = 
where Ak is given by the local spatial discretization over an element K G fit 

V^AkU = ( f \7p■^f)~dx (4) 

Jk Jk 

^ W “ ”” ■ {{“W) 

We are interested in deriving bounds on the real and imaginary parts of the spectra of the DG operator in 
order to accurately determine a stable timestep restriction. For example, we may bound the timestep by 
dr < l/p{M~^A), the inverse of the spectral radius of M~^A. 

Prevous work has been done in estimating stable timesteps for structured grids [311112], and explicit 
information about the spectra of the discrete DG operator has been derived by Krivodonova and Qin in ID 
[291130] . We use an approach similar to Gohen, Ferrieres, and Pernet in m and derive bounds for p{M ^A) 
that depend on explict expressions for the constants in discrete trace and Markov inequalities. The first 
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bounds the norm of a function on the boundary dK of an element K by the norm of the function 
over the element 

while the latter bounds the Lp' norm of the gradient of a function by the Lp norm of the function in the 
interior 

11'^'*^11 L2(_fC) — K) ||li ||■ 

For SEM, the norm is replaced by the equivalent discrete norm computed using GLL quadrature. The 
mesh and order-dependent constants Ct{N,K),Cm{N,K) determine the timestep restriction required for 
stability. Since we wish to take the timestep as large as possible, we require sharp expressions for these 
constants in order to accurately estimate p{M~^A). 

We note that the bounds we derive assume polynomial basis functions. This assumption is violated for 
non-afSne mappings of the wedge, where we use rational LSC-DG basis functions. However, for such bases, 
we may use weighted Markov and trace inequalities from [45] in place of standard polynomial inequalities. 


3.1. Bounds on the spectra 

We derive here bounds on the spectra of the DG operator based on the constants in trace and Markov 
inequalities over reference elements. These bounds apply for both the GL formulation with full mass matrix 
M (or true inner product) and the SEM lumped mass matrix Msem (or discrete inner product based on 
GLL quadrature); the difference between the two is reflected in the constants in each inequality. 

The eigenvalues A and corresponding eigenvectors v of M~^A are given by the generalized eigenvalue 
problem Av = XAIv. We may derive a more detailed bound on the spectra of M~^A by decomposing the 
matrix into symmetric and skew-symmetric parts 

A={A^ + A'^), A^ = ^{A + A^), A'^ = ^{A-A^). 

From [T], we have that 

\^UM-^A^) < Re {X{M-^A)) < X^,^{M-^A^) 

A„,in(M-M'=) < Im(A(M-M)) < A^ax(M-iAl'=), 

In other words, the magnitude of the real and imaginary parts of the spectra are bounded by p{M~^ A‘^) and 
p{M~^A^). Since A® is symmetric and real, its spectral radius is given by the generalized Rayleigh quotient 

U^A‘^U 
Wmu ■ 

From the definition of A^ in Q, we can show that 


p{M-^A^) = max 


U'^A^U 



- ^ H {^pllbllli2(j) +r„||K]|lE2(y)}, 

/erh 


(5) 


where we have used the normal jump [itn] = u'^n^ + u n . We may bound the norm of the jumps 




L^U) 


|2 

Il2(/) ■ 


^IIK 


|^ 2 (/) < ||M+n+| 


i"(/) 


+ \\u n 


|2 

Il2(/) ■ 


and bound the sum of jumps and averages over faces by boundary values 


1 


E 5iiii>iiih„,< E E ii«p»iih(„ < E WPWh^idK) ■ 


/er^ 


KeQh 




KeQh 
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Combined with ([^ and changing to a sum over elements, we can apply trace inequalities to arrive at 

iCGSth 

< ^ CT{N,K)[r,^KM L^(K) + ll“llL2(iG)| j 


where Ct{N, K) is the constant in the trace inequality over the boundary dK of the element K and Tp^K, Tu^k 
are the maximum values of the penalties over the faces of an element. Combining these bounds, we have 


max 


U'^A^U 


^P,K Iblli2(^) +tu,k ||m| 

LHK)} 

U'^MU 

Y^Ken^ ‘ 

Kk 

llPWl^iK) + PK \\'^\\\-^(K) 



< maxC t(A^, isi) max Tp^xKK, —^—, 


K 


Tu,K 


PK 


The spectral radius of M~^A^ may be bounded by noting that skew-symmetric matrices are normal, 
which implies that p{M~^A^) = A characterization of this norm is given as follows: for a 

skew-symmetric matrix Z € 7?."^", 

||Z|| = max - 2 - 




+ II 5 II 


2 ■ 


This characterization has been proven in other settings (see, for example, |12j). A straightforward modifica¬ 
tion of this theorem to use the norm ||/||^ = y/ f^Mf gives that 


l,.r-l.fell 2V^A’^U 

\M ^A^W = max -^^ 

' " niL + II^IlM 


where the denominator ||C7||^ + ||C||^is 

II^^IIm + II^IIm = X! lbllL2(iC) + PK li“llL2(^) + — |k|lL2(^) + PK W^W^^t^K) 


From Q, we may compute the skew-symmetric form 

V^A^U= Y (/ uVv- f [ (-{{«}} K1 + {{"t}} b«]) 




where we have used the vector-valued jumps |p„] = p+n+ + p n and [u„] = + v n 

We may bound the volume terms using Markov and Young’s inequality 


E 

KGClh 


u • Vv — / Vp • T 


fK 


IK 


llVullj 


IVpIlj 


< E ii“ii 

Kecih 

< E 2'^^m{N,K) ^Ib||i2(^) + ||w|lL2(iC) + lkllL2(i^) 




KeQh 


< ^ ^/Cm{N,K) max ( kk,— ] 

2 \ PkJ 

where Cm{N,K) is the constant in the Markov inequality for the element K. 


17 














The surface terms may also be bounded using Young’s inequality 


E / (-{M}K1 + {M}W) 


/er^ ■ 


< 


E (|I{M}IIl3(/) IIK1IIl=(/) + II{M}IIl^(/) IIbnlllL^(/) 


/er^ 


-9 E (ll^^lli^iaA:) + II“IIl=(9a:) + ll'^lli^iaif) + ll'^llL2(aK)) 

KgQh 


KeQh 


1 


< - maxC'T(iV, K) max kk, — 


2 K 


1 


Pk 


2 

M ' 


) 


Combining these two bounds gives 

II,. 1.fell 2V^A’^U 

= max 


u,ven^ \\U\\l + \\V\\l ^ 


max max (^^/Cm{N,K) + Ct{N,K)') . 


3.2. Constants in trace inequalities 

An explicit expression for the trace inequality constant for a general d-simplex was given by Warburton 
and Hesthaven in m, and was extended by Hillewaert in his thesis to hexahedra and wedges in [53]. While 
the trace constant for the pyramid does not appear to be available in an explicit form, Hillewaert also 
proposed an empirical fit of the trace inequality constant for pyramids, which was based on a lower bound 
for the trace constant that could be derived in closed form. An improved asymptotically optimal constant 
was given by Chan and Warburton in m- The trace inequality is as follows: for u in the approximation 
space, the norm of u on a face / of the reference element AT, 

||w|li2('j) < Cf{N) ||u||^2(jf) • 

These constants were derived assuming triangular faces with area 2 and quadrilateral faces of area 4 on 
bi-unit reference elements. 

The constants in the trace inequality over the surface dK of a reference element may be given in terms 
of the constants over the different types of faces. Assume the trace inequality is derived for face /. Then, if 
we have Nf faces of a reference element A", 


Nf 

i|'“|Il2(o.^) = E ii^iii^(/j) 

i=i 


N 


\Iil 

fcl/l 


I1^IIl2(j) < 


Nf 

E 


i=i 


CfjN) 

I/I 


|/il \\u\\\ < max 


Cf{N) 

I/I 


dK 


2 

K ■ 


where we have used Holder’s inequality with p = 1 and q = oo and the fact that, under an affine transform 
(i.e. rotation), ||u||^. = |/j|/|/| HmUj. We will refer to Ct{N) as the constant in the trace inequality over the 
full surface dK, such that 




dK 


We summarize these trace constants Cf{N) and Ct{N) in Table[^ We remark that the constant Ct{N) for 
tetrahedra may be equivalently derived using the trace inequality from |47j 


p <A 

\dK - 


l){N + 3) \dK\ 
3 \K\ 


Ik 


when the element K is taken to be the bi-unit right tetrahedron K. 

The constants in Tablej^have the benefit of being fully explicit in N. However, since the values of Ct{N) 
are based on face-by-face estimates (instead of considering the whole surface dK), surface trace inequalities 
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Element type 

Triangular face 

Quadrilateral face 

C't(TV) 

Hexahedra 


(TV+ 1)2/2 

4(TV + 1)2 

Wedge 

(TV+1)2/2 

(TV+l)(TV + 2) 

(V2 + 3)(Af + l)(TV + 2) 

Pyramid 

(TV+l)(TV + 2)/2 

(TV+l)(TV + 3) 

{V2 + 2){N + 1){N + 3) 

Tetrahedra 

(TV+l)(TV + 3)/2 


(v^ + 3)(TV+ l)(TV + 3)/2 


Table 2: Summary of the trace constant Cf{N) in the discrete trace inequalities for different faces of various 
reference elements, as well as the analytic trace constant Ct{N) = maxf{Cf{N)/ \f\) dK 
dK. 


over the surface 



Figure 4: Comparison of face-based bound Ct{N) to numerically computed trace inequality constants on 
each reference element. 


under these constants are not tight. Figure [^compares numerically computed trace inequality constants to 
the derived analytic constants Ct{N) in Table For each element up to iV < 7, the estimated analytic 
constant is a factor of 2-3 larger than the computed constant. 

For the reference wedge, pyramid, and tetrahedra, we have not found explicit expressions for numerically 
computed constants in surface trace inequalities. However, for the hexahedron, we are able to derive that 
the constant in the bounds 

ll^llL2(ai?) — , ll“llsEM(aic) — C'sem(.^) ll^llsEM{/f) 

are given exactly by 

where the bound for the SEM inequality is valid for TV > 1. These constants are a refinement of explicit 
bounds given by Ern and Burman [5] and Evans and Hughes m, and proofs are included in 

Under a mapping from reference element K to physical element K, we may derive a bound in terms of 
the determinant of the Jacobian J and the determinant of the surface Jacobian J'* 


Appendix A.l 


l|■“llL2(^if) < C't(TV) II II J ^ llioo^lf) ||■w||i2(;^;) ■ (6) 

If the LSC-DG basis is used for the wedge, the polynomial trace inequalities used here may be replaced by 
trace inequalities for weighted polynomial spaces |46j , resulting in a bound of the form 


IKII L^(dK) — Ct{N) 


T 






For affine mappings of faces where J®, J are constant, the ratio of their norms reduces to the ratio of the 
physical surface area to reference surface area divided by the ratio of physical element volume to reference 
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element volume 




\dK\/ 


\K\/ 


dK 


K 


3.3. Constants in Markov inequalities 

Expressions for the constants in Markov inequalities over simplices are given in |36) : however, these 
bounds have not yet been extended to hexahedra, tetrahedra, and pyramids. We take a more heuristic 
approach here and compute numerically the constant in Markov inequalities over the reference element. We 
may bound 

2 

^ ^rst \\'^rstu\\j^2/g\ , 

L^(K) ^ ' 


llVull 




= E 


du 


dxk 


where C^si — ^^^rst,xyz hC^ z 

y Ox,y,z 

constants in the reference Markov inequality 


is the max norm of the Jacobian matrix for the element K. The 


II < Cm{N) ||'w||^2 ('^) 


may then be computed numerically over the reference element through the solution of an eigenvalue problem. 
The bound is completed by bounding reference quantities by physical quantities using J~^ 

||Vu||i2(^) < CMiN)Cl, II J||^<.(^) \MIhk) • 


3 . 4 . N-dependent bounds for p[M ^A) 

To summarize, the trace constant over a mapped element K is given by 

Ct{N,K) = CriN) II , 

while the Markov constant Cm{N, K) is given by 

Cm{N,K) = 

The real part of the spectra of M~^A is bounded by the trace inequality constant and physical parameters 
|Re(A)| < maxmax (^Tp^kKk, CriN) || J®||ioo(ajf) ■ (7) 

Likewise, the imaginary part of the spectra of M~^A depends on the quantity 

|Im(A)| < nmxmax [\/Cm{N,K) + Ct{N,K)^ . (8) 

The latter term expands to 

Crst^CM{N) ll^ll^ooj'u^ IIJ ^IIl°°(£) + Ct{N) II J llioo^aif^ II J |lL“(if)' 

We analyze the bounds derived above for the acoustic wave equation with k = p = 1 and compare them 
to numerically computed spectra of M~^A. Figure shows the computed spectra of M~^A for a hybrid 
mesh using the GL formulation. Bounds on the real and imaginary parts of the spectra derived by computing 
p{M~^A’^),p{M~^A^) are also included as dotted lines. 

We note that our estimates for p[M~^A^) provide a loose bound on the imaginary part of the spectra. 
However, we observe that the spectral radius of the symmetric part p(M~^A^) provides a relatively tight 
bound on p{M~^A). Motivated by this observation, we compare p(M~^A) to four different bounds based 
on p{M-^A^): 
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(b) Spectra and bounds 

Figure 5: Spectra for a hybrid mesh using Gauss-Legendre quadrature, along with bounds (dotted lines) 
computed from and p(A^), and analytic bounds (squared lines). The bounds for TV = 2, 3 are truncated 
for visualization purposes. 


1. The true spectral radius of M 

2. The bound 0 , where the trace inequality constant Ct{N, K) is computed numerically as the maximum 
trace constant over each mapped element K. 

3. The bound Q , where the trace inequality constant Ct{N) is computed numerically over the reference 
element K (^Appendix B). 


4. The bound Q, where the trace inequality constant Ct[N) is given by the estimate 0- 

In the latter two bounds, the L°° norm of and J® is approximated by taking the maximum value 

over quadrature points. Table shows each bound compared to p{M~^A) for various N. The bound on 
p{M~^A) using computed trace inequality constants over the reference element is less than a factor of 2 
away from the true spectral radius, and we use this to estimate the largest stable timestep for a given mesh 
and discretization. The bound on p{M~^A) using analytically derived trace constants over faces provides a 
looser bound (factor of 4-5 away from the true spectral radius), but has the advantage of being fully explicit 
in N. 

Bounds for the spectral radius under the SEM formulation may be derived by substituting in trace 
inequality constants using SEM quadrature over quadrilateral faces. Computed trace inequality constants 


Ct(TV) for both the GL and SEM formulations are given in Appendix B Analytic expressions for constants 
in trace inequalities may also be derived using norm equivalences between SEM and norms for polynomials 
of order N. 

Numerical experiments confirm that p {M~^A^ is bounded by the maximum trace inequality constant 
max^ Ct{N, K). Since a stable timestep is given by dr < 1/p (^M~^A'), we estimate the global timestep by 


dr = 


C 


C 


K)) ma^K {C^^,Ct{N)C^) ’ 
where = max {tp^k^k, , G is a tunable global CEL constant, and Cf is 

lL“( 9 f?) IL“(ff) element is a hex, pyramid, or tet, or 

if the element is an LSC-DG wedge. 


Gf = \\r 


'j 

-iK 


Cf = 


L°°[dK) 


Since we have estimated the global timestep by taking the maximum trace constant over all elements, we 
may also use local trace constants to estimate stable local timesteps drx using the constants in local trace 
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N 

1 

2 

3 

4 

p{M-^A) 

10.89 

18.66 

28.18 

41.09 

p{M-^A^) 

11.26 

19.30 

29.05 

42.86 

Gomputed CriN^K) 

12.22 

20.84 

32.76 

47.47 

Gomputed Ct{N) 

17.24 

29.47 

46.32 

67.13 

Analytic Ct{N) 

38.53 

74.73 

124.54 

186.81 


Table 3: Spectral radius of p{M ^A) and various bounds for a hybrid mesh and TV = 1,..., 4. 




(a) Wedge (b) Pyramid 

Figure 6: Subdivisions of hexahedra used to construct uniform meshes of wedges and pyramids. 


inequalities 




(9) 


4. Numerical experiments 

In this section, we present results from hybridg for meshes of individual element types and hybrid meshes 
at various orders of approximation. 

4--1. Verification: individual element types 

We begin by comparing the SEM and GL formulations for a series of uniformly refined meshes. Hex 
meshes are constructed by subdividing the unit cube into hexahedra of size h, and we construct wedge and 
pyramid meshes by further subdividing each hexahedra into either 2 wedges or 6 pyramids, as shown in 
Figure]^ Figure]^ shows numerical convergence rates obtained using the resonant cavity solution 

p{x, y, z, t) = sin(7ra;) sin(7r?/) sin(7r2:) cos(-\/37rr) 

over the unit cube [0,1]^. Since the solution is smooth, the best approximation in converges with a rate of 
though DG is guaranteed only a convergence rate of for general meshes. We observe optimal 

convergence rates for the GL formulation, while the SEM formulation yields computed convergence 
rates somewhere in between and (though for the hex, these rates are less informative since SEM 

errors do not follow a constant rate as closely). For all orders and meshes, the error for the GL formulation 
is lower than that of the SEM formulation, though this difference is less pronounced at higher N. 

We also compare standard wedges to LSG-DG wedges in Figurej^by performing convergence tests on two 
sequences of meshes. For one sequence, we refine a uniform mesh of affine wedges with the GL formulation 
(referred to as “GL Wedge”). For the other, we begin with a mesh of randomly warped wedges, and refine 
the mesh by bisection to produce a sequence of meshes (referred to as “GL LSC wedge”). Optimal rates of 
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(a) Hex 


(b) Wedge 


(c) Pyramid 


Figure 7: Convergence rates for both the SEM and GL formulations for meshes of hexes, wedges, and 
pyramids. 



Figure 8: Convergence of LSC-DG (right) on a mesh of warped wedges (left). The vertex positions are 
perturbed by 10%, and the mesh is then refined by splitting. The convergence of LSC-DG is plotted using 
a black line, while red lines compare the convergence of DC on an un-warped mesh of wedges. 


convergence are observed for both sequences (for the SEM formulation, similar behavior to standard SEM 
wedges is also observed when using LSC SEM wedges). We note that the observed optimal convergence 
rates for LSC-DG are likely dependent on the fact that refinement by bisection produces asymptotically 
afRne elements. If we perturb vertex positions at each level of mesh refinement, we do not in general observe 
optimal convergence rates. This issue is observed also for standard mapped approximation spaces [5], though 
it is likely exacerbated by the LSG-DG approximation. 

^.2. Verification: hybrid cube meshes 

We check also numerical convergence rates for unstructured hybrid meshes. Lf errors are computed on a 
sequence of hybrid meshes containing elements of all types . Each element is refined by a self-similar splitting, 
except for pyramids, which are refined into both pyramids and tetrahedraQ These meshes contain 201, 1704, 
14016, and 113664 elements, respectively. The breakdown of the number of each individual elements of each 
type is given in Table 5 MRAB levels are used, with a GFL constant of 0.5. Results are computed using 
double precision; single precision behaves similarly, though the error stalls at around 5 x 10“^. 


^Each element is refined according to the pattern specified by the “refine by splitting” option in GMSH m- 


23 














































Hexahedra 

Wedge 

Pyramids 

Tetrahedra 

Total elements 

Mesh 1 

84 

10 

24 

83 

201 

Mesh 2 

672 

80 

96 

856 

1704 

Mesh 3 

5376 

640 

384 

7616 

14016 

Mesh 4 

43008 

5120 

1536 

64000 

113664 


Table 4: Number of elements of each type for each hybrid mesh used in convergence tests. 



Figure 9: An exploded view of the hybrid mesh with 201 elements (left) and errors for both the SEM 
and GL formulation (right). 


Under the assumption that the mesh size h is halved at each time, we estimate the asymptotic convergence 
rate in Figured for = 1, 2, 3. For both the SEM and GL formulations, we observe orders of co nverg ence 
(in between and similar to those reported for single element-type meshes in Section 


4.1 


4 ..3. Computational results 

In Section 4.3.1 and 4.3.2[ we quantify the computational cost of DG solvers on hybrid meshes in terms 
of runtime per degree of freedom. While the approximation power per degree of freedom varies from element 
to element, the cost per dof gives a rough estimate of the efficiency of each kernel. In Section |4.3.3[ we 
quantify the computational efficiency of the solver in terms of estimated bandwidth and GFLOPS for each 
element kernel. All computations were run on a single Nvidia GTX 980 GPU in single precision. 


4 . 3 . 1 . Cost of SEM vs CL formulations 

Since the computational structure of the volume kernel is identical between the SEM and GL formula¬ 
tions, any additional cost associated with the GL formulation lies in the surface and update kernels. We 
implemented two separate surface and update kernels and report the relative per-dof speedup of SEM com¬ 
pared to GL for orders N = 1,..., 5 in Table Eor all orders, the cost of the SEM update is lower or equal 
to the cost of the GL update. For low orders, the SEM-tailored surface kernel performs slightly worse than 
that of the GL kerneQ However, as the order N increases, the SEM kernel becomes cheaper due to the fact 
that data from interior nodes does not need to be accessed. When comparing total runtime at A^ > 1, the 
SEM formulation becomes roughly 5— 10% cheaper per-dof than the GL formulation, similar to the 10 — 15% 
cost savings using SEM over GL as reported by Kopriva and Gassner on GPU architectures [5S]. 


^The slower runtime may be due to additional conditional statements and the structure of memory accesses in the SEM- 
tailored kernel. We note that it is always possible to run the SEM formulation using GL kernels, and thus the cost of the SEM 
surface kernel can always be made in practice to be less than or equal to the cost of the GL kernel 
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N 

1 

2 

3 

4 

5 

Ratio of surface cost of SEM vs GL 

1.1243 

1.0597 

0.9350 

0.8986 

0.8329 

Ratio of update cost of SEM vs GL 

0.9884 

0.7437 

1.0000 

0.8408 

0.9000 

Ratio of total cost of SEM 

vs GL 

1.0432 

0.9286 

0.9798 

0.9163 

0.9134 

Table 5: Total time-per-dof cost for SEM vs 

GL hexahedra. 


N 

1 

2 

3 

4 

5 



Hex SEM kernels 




Volume 

1.2238 

1.0373 

1.1147 

0.9538 

0.7772 



Surface 

0.9631 

0.8735 

0.5666 

0.6038 

0.5808 



Update 

1.0254 

0.8427 

0.9161 

0.8762 

0.6667 



Total (all kernels) 

1.0492 

0.9108 

0.8222 

0.7987 

0.6738 



Hex GL kernels 




Volume 

1.2238 

1.0373 

1.1147 

0.9538 

0.7772 



Surface 

0.8567 

0.8243 

0.6060 

0.6719 

0.6974 



Update 

1.0374 

1.1332 

0.9161 

1.0422 

0.7407 



Total (all kernels) 

1.0058 

0.9808 

0.8392 

0.8717 

0.7377 



Wedge kernels 




Volume 

2.965 

1.800 

2.999 

2.596 

2.975 



Surface 

0.912 

0.778 

0.99 

2.086 

1.566 



Update 

1.414 

1.02 

1.111 

1.321 

1.485 



Total (all kernels) 

1.5937 

1.1450 

1.5972 

2.0407 

2.0242 



Pyramid kernels 




Volume 

1.912 

1.653 

2.418 

3.228 

2.842 



Surface 

0.959 

0.855 

0.634 

1.657 

1.157 



Update 

1.51 

1.033 

1.112 

1.169 

1.804 



Total (all kernels) 

1.3705 

1.1386 

1.2788 

2.0470 

1.9280 



Table 6: Time-per-dof cost (relative to tetrahedra) of hexahedra, wedge and pyramid kernels. 

4-3.2. Cost per element type 

In this section, we compare the computational cost of hexahedra, wedges, and pyramids relative to the 
computational cost of tetrahedra. The results reported are for tuned computational kernels, where the 
number of elements processed per workgroup has been chosen in order to minimize the runtimes of the 
volume, surface, and update kernels for each element [31117]. As suggested in |2S], automation of this 
process is crucial for portable performance across various architectures, especially for hybrid meshes where 
parameters must be tuned for 12 separate kernels. 

Tableshows the time-per-dof cost for K « 100000 hexahedral, wedge, or pyramidal elements relative to 
the time-per-dof cost oi K k, 100000 tetrahedral elements. While hexahedra are observed to be faster per-dof 
than tetrahedra at higher orders, wedge and pyramidal volume kernels are observed to be 1-2 times slower 
per-dof than tetrahedra kernels. This may be due in part to the fact that, for planar tetrahedra, geometric 
factors are constant over an element. As a result, only a few values must be loaded per element independently 
of N. This is in contrast to hexahedra, wedges and pyramids, for which the number of geometric factors to 
be loaded increases along with the total number of nodal or cubature points. 

We note that these numbers give optimistic estimates on per-dof costs for two reasons. First of all, for 
meshes containing only planar tetrahedra, the cost of the surface and update kernels may be further reduced 
by replacing surface quadratures with lift operators for nodal bases |22L I2B) . Secondly, on hex-dominant 
meshes, the number of wedges and pyramids is expected to be much smaller than the number of hexahedra 
and tetrahedra. The runtimes reported previously are for asymptotically “large” numbers of elements, and 
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Figure 10: Time-per-dof cost of hexahedral, wedge, pyramidal, and tetrahedral volume kernels on meshes 
with K « 100,1000,10000 and 30000 elements at various orders N. 


may increase for a small number of elements. 

To illustrate the dependence of runtime on number of elements, we timed the cost of the volume kernel 
over 500 timesteps on meshes containing approximately 100,1000,10000, and 30000 elements. For ease of 
presentation, we show results only for the volume kernel, though the behavior is similar for the surface and 
update kernels. The order of approximation is varied from N = 1,..., 5 for each mesh, and the kernel 
runtime is reported in Figure |10| Since the computational structure of each volume kernel is the same 
between the SEM and GL formulations (with the exception of the pyramicQ we show timings for only four 
volume kernels. 

Since the time-per-dof decreases as more elements are processed, if a mesh contains a small number of 
wedges or pyramids, the actual cost of a wedge or pyramid relative to a tetrahedron may be slightly greater 
than the reported values of Table though this effect is less pronounced at higher orders of approximation. 
For example, for N = 3 on the largest hybrid mesh (consisting of 113664 elements, with 5120 prisms, 1536 
pyramids, and 64000 tetrahedra), prisms and pyramids were respectively 1.7563 and 1.8849 times more 
expensive per dof than tetrahedra, instead of the estimated 1.6 and 1.28 times suggested by Table 


4-. 3.3. Estimated GFLOPS and bandwidth 

Finally, we present estimated GFLOP and effective bandwidth counts for each kernel in Table For 
reference, we include the estimated bandwidth and GFLOPS of a repeated entrywise multiplication kernel 
in Figure 11 For an array A of sufhciently large size (running with the maximum number of threads), 
we execute A^muit times the command Aij = — C, where C is some constant. As fVmuit increases, the 

kernel changes from being 10 bound to compute-bound. We achieve roughly 1970 GFLOPS and 168 GB/s 
bandwidth at most on an Nvidia GTX 980. When utilizing float4 operations, we observe the same peak 
bandwidth, but achieve 2930 GFLOPS at peak, roughly 1.5 x the performance using only floats. While these 
numbers are not necessarily indicative of peak performance, we believe they are representative of “good” 
performance for a given kernel. 

With this in mind, we may compare the GFLOPS and bandwidth of kernels for each element. Comparing 
only the volume kernels, we see efficiency resulting from both effective hardware utilization and reduced 
computational load. For example, the wedge volume kernel shows low estimated bandwidth with high 
estimated GFLOPS, indicating that its performance is compute-bound. We note that the observed GFLOPS 
are relatively high, which may be due to the heavy use of f loat4 operations within the wedge volume kernel. 
On the other hand, the bandwidth and GFLOPS for the tetrahedron volume kernel are more balanced. 
Finally, though it is similar in structure to the tetrahedral kernel, the pyramid volume kernel shows high 
estimated bandwidth and low GFLOPS, implying that it is 10 bound. This is likely due to the fact that the 
skew-symmetric form is used, and requires the loading of twice as many operators (derivative matrices and 


®Under the GL formulation, the pyramid volume kernel may use the more efficient strong formulation. We show timings, 
GFLOPS, and estimated bandwidth for the skew-symmetric pyramid kernel only. 
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(a) Standard float operations 


(b) Float4 operations 


Figure 11: GFLOPS and bandwidth (for a kernel executing repeated entrywise array multiplication/addition) 
as a function of the number of repeated multiplications As A^muit increases, the kernel goes from being 

10 bound to compute-bound. Results from two kernels are presented: one where computations are done 
using only standard single precision floats, and one where computations are performed using the float! data 
type. 


their transposes) as the tetrahedral kernel. 

For hexahedral kernels, the operators are explicitly stored in shared memory, removing caching effects. 
As a result, the reported bandwidth and GFLOPS are both significantly lower than for other elements, 
but near-peak bandwidth numbers are achieved for N > S, implying the kernel is 10 bound. This near¬ 
peak performance is likely due to the loading of multiple geometric change-of-variables factors per node per 
element and the low arithmetic intensity of tensor-product operations, which are effectively hidden during 
memory retrievals. It has been noted in |89j that, for vertex-mapped hexahedra, the 10 bound nature of 
the hex volume kernel may be addressed by computing these geometric factors on the fly inside the kernel, 
loading only vertex positions (regardless of order). 

For wedges, pyramids, and tetrahedra, bandwidth is reported both with and without counting operator 
loads per element. The reported bandwidth numbers with operator loads are often higher than peak rates, 
which is likely due to the fact that compiler optimizations and caching effects are not taken into account 
during estimates. Removing operator loads from bandwidth counts reveals that the bandwidth for prism, 
pyramid, and tetrahedron kernels decreases as N increases, verifying that operations for these three element 
types are compute-bound at high orders. 

5. Conclusions 

We have presented in this paper a high order discontinuous Galerkin solver on hybrid, hex-dominant 
meshes. A careful selection of basis functions controls the storage cost of the scheme, making it suitable for 
acclerators and GPUs. Two energy-stable formulations are given, based on either SEM or Gauss-Legendre 
nodal bases for the hexahedron. Explicit bounds on the spectra are computed in terms of the order-dependent 
constants in the trace inequalities for each type of element, and are used to determine local timestep restric¬ 
tions for multi-rate timestepping. Convergence rates are reported for each formulation, and computational 
cost is assessed by estimating the GFLOPS, bandwidth, and cost-per-dof of each element. 

The focus of this work has been mainly based presenting a DG solver for hybrid meshes on a single GPU. 
Future work will focus on the following areas: 

• Wedge basis: while the LSC-DG wedge basis achieves optimal rates of convergence, there is additional 
complexity and computational cost compared to polynomial bases (necessity of quadrature, storage of 
derivatives of J). A low-storage polynomial wedge basis could reduce costs for wedge volume kernels 
and surface/update kernels for wedges, pyramids, and tetrahedra. 
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N 

1 

2 

3 

4 

5 

Hex SEM 


Volume 

100 

134 

173 

199 

232 

Surface 

68 

51 

57 

47 

35 

Update 

46 

34 

49 

40 

44 

Hex GL 


Surface 

60 

44 

52 

43 

32 

Update 

45 

46 

49 

48 

48 

Wedge 


Volume 

242 

552 

1032 

1551 

1956 

Surface 

197 

418 

566 

452 

771 

Update 

166 

403 

736 

904 

960 

Pyramid 


Volume 

77 

161 

238 

259 

392 

Surface 

187 

344 

850 

482 

922 

Update 

136 

333 

699 

898 

670 

Tet 


Volume 

123 

200 

401 

550 

699 

Surface 

149 

228 

423 

596 

791 

Update 

161 

260 

599 

779 

931 


1 

2 

3 

4 

5 


152 

160 

169 

165 

167 

88 

68 

81 

70 

54 

159 

112 

155 

125 

133 


77 

59 

74 

64 

49 

160 

152 

155 

149 

148 


217/149 

166/103 

177/116 

271/156 

273/98 

309/132 

294/122 

328/70 

472/128 

316/107 

247/31 

533/99 

300/79 

408/33 

536/69 


220/102 

163/107 

155/103 

444/106 

236/100 

270/123 

646/83 

510/131 

464/137 

699/52 

271/46 

546/116 

1054/49 

497/55 

397/58 


238/145 

122/81 

205/147 

322/125 

147/65 

238/129 

595/139 

244/64 

434/160 

784/116 

327/58 

509/144 

974/95 

421/50 

559/112 


(a) GFLOPS (b) Est. bandwidth 

Table 7: GFLOPS and estimated effective bandwidth counts for each kernel of each element type (after 
optimization of the number of elements processed per workgroup). For wedges, pyramids, and tetrahedra, 
bandwidth is reported (with operator loads)/(without operator loads). 


• Hybrid meshes compared to tetrahedral meshes: while cost-per-dof comparisons have been 
made between tetrahedra and other element types, DG on hex-dominant meshes should be compared 
to optimized nodal DG on fully tetrahedral meshes. The effect on efficiency of the ratio between hexes 
and elements of other types should also be carefully investigated. 

• Parallelization and scalability: multiple GPUs are necessary to handle larger problem sizes. Per¬ 
formance of multiple GPU-accelerated DG will require load balancing and communication strategies 
for multi-rate solvers on hybrid meshes. 
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Appendix A. Explicit trace inequalities for tensor product elements 


The constant in the surface trace inequality for tensor product elements may be computed explicitly. 
Throughout these proofs, we define /j be the normalized Legendre polynomials, orthogonal with respect to 
the standard lS{[—l, 1]) inner product. We begin by proving a result in ID. 

Lemma Appendix A.l. Let u € P^{[—1,1]). Then, 


K-i)| + Ki)|< 


{N + l){N + 2) 
2 


M 


2 

LHl-Li]) 


Proof. In ID, we wish to derive an expression for the spectral radius p{Ms), where 


{Ms)ij = /ji-l)//-!) + (j)j{l)(j)^{l). 

and <i)j{x) are normalized Legendre polynomials. Since Legendre polynomials are symmetric across x = 0, 
^j(—1) = 4'j{l) for j even, and <j)j{—l) = —/j)!) for j odd. Then, assuming i and j do not share the same 
parity, we have 

{Ms)^j = /j{-l)//-l) + /j{l)M^) = 0 . 
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This implies Ms is block diagonal, with each block corresponding to even/odd powers of the Legendre 
polynomial. Both the even block M® and odd block M° reduce down to 


These are both rank 1 matrices, such that 

p(M®) = 2 ^ </),(-l)2, p(M°)=2 ^ 

j—even j—odd 

For both even and odd i,j, = (2* + l)/2, implying that 

p(M,) = 2max(p(M®),p(M°)) = (^ + + 2) ^ 

□ 


The extremal polynomial for the above inequality may also be characterized explicitly. 

Lemma Appendix A.2. Define ri to be the quadrature points of the {N + 2)-point Gauss-Legendre-Lobatto 
(GLL) rule. Let u be the order {N + 1) Lagrange polynomial that is zero at N interior GLL nodes and unity 
at either r = l orr = —1. Then, 


K-i)| + |u(i)| 


(A+l)(A + 2) 
2^ 


M 


2 

LHl-hi]) 


Proof. We note that, while the {N + l)-point GLL quadrature does not integrate polynomials of order 2N 
exactly, polynomials of order 2N are integrated exactly by the {N + 2)-point GLL quadrature. 

Since u{r) is symmetric or antisymmetric across r = 0 (this may be shown by explicitly writing out the 
interpolating Lagrange polynomial), m(1) = ±m(— 1) depending on whether N is even or odd. Then, ||m ||^2 
is given as 


nl ^ + 2 

lkllL2 = / WjU(rj) = Wiu{-lf + Wn+2u{IY = m(1)^(wi + Wn+2)- 

“'-1 j=l 

The surface integral of is just u(l)^ + w(—1)^ = 2u(l)^. As a result, 

2u{lf _ ^ _ (A + l)(A + 2) 

u{IY{wi-^wn+ 2) wi 2 

by the fact that the GLL weights for an (A + 2)-point rule are each at the endpoints. □ 

A curious coincidence is that, since the A + 1-point Gauss-Radau-Jacobi (GRJ) rule contains all but one 
endpoint of the (A -|- 2)-point GLL rule, the extremal polynomial is identical to the extremal polynomial for 
triangle faces, which is unity at the —1 GRJ point and zero at all others m- 

We may prove a similar result, replacing the mass matrix with the underintegrated Gauss-Legendre- 
Lobatto quadrature mass matrix. 

Lemma Appendix A. 3. Let u G P^([—1,1]). Then, for N > 1, the following is bound is tight 
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Proof. We take the basis (j)i{r) = £i{r)/^/wi, where £i{r) is the Lagrange polynomial at the ith GLL node 
and Wi are the quadrature weights of the {N + l)-point GLL rule. Since this basis is orthogonal with respect 
to the SEM inner product, the constant in the trace inequality is the spectral radius of the surface mass 
matrix under this basis, which has entries (Mg)y = Applying the Lagrange 

property, we see that is diagonal with only two nonzero diagonal entries 

= —, {Ms)n+i,n+i = -■ 

Wi Wn+1 

p{Ms) is the maximum of these two entries, and the bound is proven by noting wi = wn+i = jv(jv+i) • 
extremal polynomial for this bound is any convex combination of 4>i{r) or □ 


Appendix A.l. Tensor product elements 

For a d-dimensional hypercube [—1,1]'^, we may define an orthonormal tensor product basis as 


4'iki'^k), 0 < ifc < N, 


k=l 


where is the coordinate in the fcth direction. Using the above basis, we may prove the following theorem: 
Theorem Appendix A. 4. Let u G span for 0 < ii,.. < N. Then, 


II l|2 ^ + 1)(-^ + 2) II II 


Furthermore, this bound is tight. 

Proof. The surface mass matrix over the d-dimensional hypercube is 

d 


d .1 

= 'y ' {4>ik (~1) A 4'ik (1)) TT / 

k=l 


k=l 

By the orthogonality of Legendre polynomials, this reduces to 

d 




fc=l 

l^k 

This results in a block diagonal matrix in all indices ii,ji for I k. Also, by the same argument as in the 
ID case, if pairs of indices ik,jk do not share the same parity. 


implying that each matrix consists of two blocks of even and odd indices. Additionally, each matrix 
Mg contains the same entries as the ID surface mass matrix and are identical up to a permutation. An 
application of Weyl’s inequality then gives 


d 

p{Mg)<Y,p{M^) 


k=l 


^{N + l){N + 2) 
2 
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We may show that this bound is tight by constructing the extremal polynomial in d dimensions as the 
tensor product of the ID polynomial. Take uiuifk) to be the ID extremal polynomial in the fcth coordinate 
and define 

d 

w(ri, ...^rd) = Y\ 

k^l 

The (A^+2)^ tensor product GLL rule integrates this exactly, such that for the reference element K = [—1,1]'^, 

7V+2 N+2 N+2 

*1=1 *2 = 1 id = l 

where 2*^ is the number of vertices on the d-dimensional hypercube. 

The surface norm of u for a d-dimensional hypercube is the sum of the integrals over the 2d faces. Since 
each face itself is a hypercube of dimension d — 1, the integral of u over the face is 2‘^~^wi, and the surface 
norm of u reduces to 


2d 

/=i 

and we may conclude that 

\ML^{dK) _ d _^{N + \){N + 2) 

MI^k) ~ 2 

□ 


For > 1, similar steps may be followed to prove the equivalent trace inequality under SEM quadrature 
in d dimensions 

„ „2 N{N + 1) ,, ,, 

We note that Ern and Burman also utilized GLL weights to prove a similar trace inequality in d dimensions 




{N + l){N + 2) 




\l^(k) 


The additional factor of (2 -|- 1/N)‘^ results from the equivalence constant between the norm and discrete 
norm resulting from underintegration using GLL quadrature. 


Appendix B. Computed Markov and trace inequality constants 

For wedge, pyramid, and tetrahedral elements, we have not been able to determine explicit expressions 
for the constants Ct{N) in trace inequalities over reference elements K 

— Ct{N) IIwII• 

However, Ct{N) may be computed as the maximum eigenvalue of a generalized eigenvalue problem 

MgV = XMv, 

where Mg is the surface mass matrix and M is the volume mass matrix. We may similarly compute the trace 
inequality constants Csem(.^) for the case when SEM quadrature is used for hexahedra and quadrilateral 
faces. For the Markov inequality on the reference element 
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N 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Hexahedron 

9 

18 

30 

45 

63 

84 

108 

135 

165 

Wedge 

9.93 

18.56 

29.03 

42.99 

58.80 

78.01 

99.27 

123.76 

150.48 

Pyramid 

11.68 

20.89 

32.84 

47.59 

65.17 

85.60 

108.90 

135.07 

164.11 

Tetrahedron 

12.22 

20.46 

29.18 

41.65 

54.45 

71.10 

88.32 

109.04 

130.67 


Table B.8: Computed trace constants Ct{N) using the full mass matrix for various reference elements up 
to N = 9. 


N 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Hexahedron 

3 

9 

18 

30 

45 

63 

84 

108 

135 

Wedge 

46.46 

51.70 

63.94 

83.32 

104.76 

132.20 

161.22 

195.95 

232.52 

Pyramid 

31.58 

37.12 

47.59 

60.95 

77.14 

96.31 

118.52 

143.80 

172.13 


Table B.9: Computed trace constants C'sem(-^) using SEM quadrature for various reference elements up to 
N = 9. SEM quadrature is used to compute the mass matrix and surface mass matrices when quadrilateral 
faces are present. The constants for tetrahedra are identical in both cases. 


we may similarly compute the constant Cm{N) as the maximum eigenvalue of 


Kv = XMv, 


where K^j = f- 


• V(/)i is the stiffness matrix over the reference element. We summarize computed 
values of Ct{N), CsEuiN) and Cm{N) up to fV = 9 in Tables B.8 B.9 and B.IO For a SEM hex, we may 
determine the constant in the Markov inequality using the full mass matrix value of Cm{N) and equivalence 
constants between discrete SEM and inner products. 


N 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Hexahedron 

3.00 

18.00 

55.73 

137.51 

293.97 

562.17 

985.92 

1616.24 

2511.47 

Wedge 

12.00 

54.27 

142.63 

308.34 

585.89 

1021.64 

1663.85 

2574.06 

3814.56 

Pyramid 

12.92 

60.05 

175.51 

405.43 

809.95 

1460.93 

2442.26 

3849.94 

5792.11 

Tetrahedron 

20.00 

78.62 

195.58 

403.91 

744.85 

1265.54 

2021.09 

3073.26 

4491.62 


Table B.IO: Computed Markov constants Cm{N) for various reference elements up to iV = 9. 
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